The file Loan_Modelling.csv contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (9%) accepted the personal loan that was offered to them in the earlier campaign.
Banking
This case is about AllLife Bank whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the AllLife Bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with minimal budget.
The classification goal is to predict the likelihood of a liability customer buying personal loans which means we have to build a model which will be used to predict which customer will most likely to accept the offer for personal loan, based on the specific relationship with the bank across various features given in the dataset. Here I will be using the Supervised Learning methods to predict which model is best for this problem amongst Logistic Regresssion, K-Nearest Neighbors(KNN) and Naive Bayes Algorigthm.
!pip install zipcodes
Requirement already satisfied: zipcodes in c:\users\srikant\anaconda3\lib\site-packages (1.2.0)
### IMPORT: ------------------------------------
import scipy.stats as stats
import pandas as pd
import numpy as np
import zipcodes as zcode # to get zipcodes
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
import statsmodels.api as sm
#--Sklearn library--
# Sklearn package's randomized data splitting function
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay #to plot confusion matric
from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import LogisticRegression #to build the model
from sklearn.tree import DecisionTreeClassifier#to build the model
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_colwidth',400)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# To supress numerical display in scientific notations
warnings.filterwarnings('ignore') # To supress warnings
# set the background for the graphs
plt.style.use('ggplot')
Comment : Here I have used numpy, pandas, matplotlib, seaborn, scipy for EDA and Data Visualization. Also used sklearn for data spliting, model building and for confusion matrix.
data='C:/Users/srikant/Downloads/Loan_Modelling.csv'
df = pd.read_csv(data)
df.head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 6 | 37 | 13 | 29 | 92121 | 4 | 0.40000 | 2 | 155 | 0 | 0 | 0 | 1 | 0 |
| 6 | 7 | 53 | 27 | 72 | 91711 | 2 | 1.50000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 8 | 50 | 24 | 22 | 93943 | 1 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 9 | 35 | 10 | 81 | 90089 | 3 | 0.60000 | 2 | 104 | 0 | 0 | 0 | 1 | 0 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.90000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
df.tail(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4990 | 4991 | 55 | 25 | 58 | 95023 | 4 | 2.00000 | 3 | 219 | 0 | 0 | 0 | 0 | 1 |
| 4991 | 4992 | 51 | 25 | 92 | 91330 | 1 | 1.90000 | 2 | 100 | 0 | 0 | 0 | 0 | 1 |
| 4992 | 4993 | 30 | 5 | 13 | 90037 | 4 | 0.50000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4994 | 4995 | 64 | 40 | 75 | 94588 | 3 | 2.00000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.90000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.40000 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.50000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.80000 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
Comment: Here I have read the Personal Loan dataset using read_csv() function of pandas. df is a dataframe. I have used head() funtion to display first 5 records of the dataset.
Target Column rearrange:- As our Target Column(Personal Loan) is in middle of dataframe so for more convinient I have drop the personal loan column from the original place and appended at last of dataframe.
Nonimal Varibles :
Ordinal Categorical variables :
Interval Variables :
Binary Categorical Variable :
rows_count, columns_count = df.shape
print('Total Number of rows :', rows_count)
print('Total Number of columns :', columns_count)
Total Number of rows : 5000 Total Number of columns : 14
Comment: Shape of the dataframe is (5000, 14). There are 5000 rows and 14 columns in the dataset.
df.dtypes
ID int64 Age int64 Experience int64 Income int64 ZIPCode int64 Family int64 CCAvg float64 Education int64 Mortgage int64 Personal_Loan int64 Securities_Account int64 CD_Account int64 Online int64 CreditCard int64 dtype: object
comment : We can also display the data types of dataframe using df.info() function which gives even more useful info.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
Comment : Here we can see that all the variables are numerical. But the columns 'CD Account', 'Online', 'Family', 'Education' , 'CreditCard' and 'Securities Account' are categorical variable which should be in 'category' type.
Processing Zipcode
Zipcode is a categorical feature and can be a good predictor of target variable. We can analyse if there is any pattern on location for customers who had borrowed loaned during previous campaign. Trying to see if we can reduce the category
df.rename(columns={"ZIPCode":"ZIPCode","Personal_Loan":"Personal_Loan","Securities_Account":"Securities_Account","CD_Account":'CD_Account'},inplace=True)
df.ZIPCode.nunique()
467
# get unique zipcodes
list_zipcode=df.ZIPCode.unique()
#here i am creating a dictionary of county by using library zipcode and matching method.
dict_zip={}
for zipcode in list_zipcode:
my_city_county = zcode.matching(zipcode.astype('str'))
if len(my_city_county)==1: # if zipcode is present then get county else, assign zipcode to county
county=my_city_county[0].get('county')
else:
county=zipcode
dict_zip.update({zipcode:county})
dict_zip
{91107: 'Los Angeles County',
90089: 'Los Angeles County',
94720: 'Alameda County',
94112: 'San Francisco County',
91330: 'Los Angeles County',
92121: 'San Diego County',
91711: 'Los Angeles County',
93943: 'Monterey County',
93023: 'Ventura County',
94710: 'Alameda County',
90277: 'Los Angeles County',
93106: 'Santa Barbara County',
94920: 'Marin County',
91741: 'Los Angeles County',
95054: 'Santa Clara County',
95010: 'Santa Cruz County',
94305: 'Santa Clara County',
91604: 'Los Angeles County',
94015: 'San Mateo County',
90095: 'Los Angeles County',
91320: 'Ventura County',
95521: 'Humboldt County',
95064: 'Santa Cruz County',
90064: 'Los Angeles County',
94539: 'Alameda County',
94104: 'San Francisco County',
94117: 'San Francisco County',
94801: 'Contra Costa County',
94035: 'Santa Clara County',
92647: 'Orange County',
95814: 'Sacramento County',
94114: 'San Francisco County',
94115: 'San Francisco County',
92672: 'Orange County',
94122: 'San Francisco County',
90019: 'Los Angeles County',
95616: 'Yolo County',
94065: 'San Mateo County',
95014: 'Santa Clara County',
91380: 'Los Angeles County',
95747: 'Placer County',
92373: 'San Bernardino County',
92093: 'San Diego County',
94005: 'San Mateo County',
90245: 'Los Angeles County',
95819: 'Sacramento County',
94022: 'Santa Clara County',
90404: 'Los Angeles County',
93407: 'San Luis Obispo County',
94523: 'Contra Costa County',
90024: 'Los Angeles County',
91360: 'Ventura County',
95670: 'Sacramento County',
95123: 'Santa Clara County',
90045: 'Los Angeles County',
91335: 'Los Angeles County',
93907: 'Monterey County',
92007: 'San Diego County',
94606: 'Alameda County',
94611: 'Alameda County',
94901: 'Marin County',
92220: 'Riverside County',
93305: 'Kern County',
95134: 'Santa Clara County',
94612: 'Alameda County',
92507: 'Riverside County',
91730: 'San Bernardino County',
94501: 'Alameda County',
94303: 'San Mateo County',
94105: 'San Francisco County',
94550: 'Alameda County',
92612: 'Orange County',
95617: 'Yolo County',
92374: 'San Bernardino County',
94080: 'San Mateo County',
94608: 'Alameda County',
93555: 'Kern County',
93311: 'Kern County',
94704: 'Alameda County',
92717: 92717,
92037: 'San Diego County',
95136: 'Santa Clara County',
94542: 'Alameda County',
94143: 'San Francisco County',
91775: 'Los Angeles County',
92703: 'Orange County',
92354: 'San Bernardino County',
92024: 'San Diego County',
92831: 'Orange County',
92833: 'Orange County',
94304: 'Santa Clara County',
90057: 'Los Angeles County',
92130: 'San Diego County',
91301: 'Los Angeles County',
92096: 'San Diego County',
92646: 'Orange County',
92182: 'San Diego County',
92131: 'San Diego County',
93720: 'Fresno County',
90840: 'Los Angeles County',
95035: 'Santa Clara County',
93010: 'Ventura County',
94928: 'Sonoma County',
95831: 'Sacramento County',
91770: 'Los Angeles County',
90007: 'Los Angeles County',
94102: 'San Francisco County',
91423: 'Los Angeles County',
93955: 'Monterey County',
94107: 'San Francisco County',
92834: 'Orange County',
93117: 'Santa Barbara County',
94551: 'Alameda County',
94596: 'Contra Costa County',
94025: 'San Mateo County',
94545: 'Alameda County',
95053: 'Santa Clara County',
90036: 'Los Angeles County',
91125: 'Los Angeles County',
95120: 'Santa Clara County',
94706: 'Alameda County',
95827: 'Sacramento County',
90503: 'Los Angeles County',
90250: 'Los Angeles County',
95817: 'Sacramento County',
95503: 'Humboldt County',
93111: 'Santa Barbara County',
94132: 'San Francisco County',
95818: 'Sacramento County',
91942: 'San Diego County',
90401: 'Los Angeles County',
93524: 'Kern County',
95133: 'Santa Clara County',
92173: 'San Diego County',
94043: 'Santa Clara County',
92521: 'Riverside County',
92122: 'San Diego County',
93118: 'Santa Barbara County',
92697: 'Orange County',
94577: 'Alameda County',
91345: 'Los Angeles County',
94123: 'San Francisco County',
92152: 'San Diego County',
91355: 'Los Angeles County',
94609: 'Alameda County',
94306: 'Santa Clara County',
96150: 'El Dorado County',
94110: 'San Francisco County',
94707: 'Alameda County',
91326: 'Los Angeles County',
90291: 'Los Angeles County',
92807: 'Orange County',
95051: 'Santa Clara County',
94085: 'Santa Clara County',
92677: 'Orange County',
92614: 'Orange County',
92626: 'Orange County',
94583: 'Contra Costa County',
92103: 'San Diego County',
92691: 'Orange County',
92407: 'San Bernardino County',
90504: 'Los Angeles County',
94002: 'San Mateo County',
95039: 'Monterey County',
94063: 'San Mateo County',
94923: 'Sonoma County',
95023: 'San Benito County',
90058: 'Los Angeles County',
92126: 'San Diego County',
94118: 'San Francisco County',
90029: 'Los Angeles County',
92806: 'Orange County',
94806: 'Contra Costa County',
92110: 'San Diego County',
94536: 'Alameda County',
90623: 'Orange County',
92069: 'San Diego County',
92843: 'Orange County',
92120: 'San Diego County',
95605: 'Yolo County',
90740: 'Orange County',
91207: 'Los Angeles County',
95929: 'Butte County',
93437: 'Santa Barbara County',
90630: 'Orange County',
90034: 'Los Angeles County',
90266: 'Los Angeles County',
95630: 'Sacramento County',
93657: 'Fresno County',
92038: 'San Diego County',
91304: 'Los Angeles County',
92606: 'Orange County',
92192: 'San Diego County',
90745: 'Los Angeles County',
95060: 'Santa Cruz County',
94301: 'Santa Clara County',
92692: 'Orange County',
92101: 'San Diego County',
94610: 'Alameda County',
90254: 'Los Angeles County',
94590: 'Solano County',
92028: 'San Diego County',
92054: 'San Diego County',
92029: 'San Diego County',
93105: 'Santa Barbara County',
91941: 'San Diego County',
92346: 'San Bernardino County',
94402: 'San Mateo County',
94618: 'Alameda County',
94904: 'Marin County',
93077: 93077,
95482: 'Mendocino County',
91709: 'San Bernardino County',
91311: 'Los Angeles County',
94509: 'Contra Costa County',
92866: 'Orange County',
91745: 'Los Angeles County',
94111: 'San Francisco County',
94309: 'Santa Clara County',
90073: 'Los Angeles County',
92333: 'San Bernardino County',
90505: 'Los Angeles County',
94998: 'Marin County',
94086: 'Santa Clara County',
94709: 'Alameda County',
95825: 'Sacramento County',
90509: 'Los Angeles County',
93108: 'Santa Barbara County',
94588: 'Alameda County',
91706: 'Los Angeles County',
92109: 'San Diego County',
92068: 'San Diego County',
95841: 'Sacramento County',
92123: 'San Diego County',
91342: 'Los Angeles County',
90232: 'Los Angeles County',
92634: 92634,
91006: 'Los Angeles County',
91768: 'Los Angeles County',
90028: 'Los Angeles County',
92008: 'San Diego County',
95112: 'Santa Clara County',
92154: 'San Diego County',
92115: 'San Diego County',
92177: 'San Diego County',
90640: 'Los Angeles County',
94607: 'Alameda County',
92780: 'Orange County',
90009: 'Los Angeles County',
92518: 'Riverside County',
91007: 'Los Angeles County',
93014: 'Santa Barbara County',
94024: 'Santa Clara County',
90027: 'Los Angeles County',
95207: 'San Joaquin County',
90717: 'Los Angeles County',
94534: 'Solano County',
94010: 'San Mateo County',
91614: 'Los Angeles County',
94234: 'Sacramento County',
90210: 'Los Angeles County',
95020: 'Santa Clara County',
92870: 'Orange County',
92124: 'San Diego County',
90049: 'Los Angeles County',
94521: 'Contra Costa County',
95678: 'Placer County',
95045: 'San Benito County',
92653: 'Orange County',
92821: 'Orange County',
90025: 'Los Angeles County',
92835: 'Orange County',
91910: 'San Diego County',
94701: 'Alameda County',
91129: 'Los Angeles County',
90071: 'Los Angeles County',
96651: 96651,
94960: 'Marin County',
91902: 'San Diego County',
90033: 'Los Angeles County',
95621: 'Sacramento County',
90037: 'Los Angeles County',
90005: 'Los Angeles County',
93940: 'Monterey County',
91109: 'Los Angeles County',
93009: 'Ventura County',
93561: 'Kern County',
95126: 'Santa Clara County',
94109: 'San Francisco County',
93107: 'Santa Barbara County',
94591: 'Solano County',
92251: 'Imperial County',
92648: 'Orange County',
92709: 'Orange County',
91754: 'Los Angeles County',
92009: 'San Diego County',
96064: 'Siskiyou County',
91103: 'Los Angeles County',
91030: 'Los Angeles County',
90066: 'Los Angeles County',
95403: 'Sonoma County',
91016: 'Los Angeles County',
95348: 'Merced County',
91950: 'San Diego County',
95822: 'Sacramento County',
94538: 'Alameda County',
92056: 'San Diego County',
93063: 'Ventura County',
91040: 'Los Angeles County',
92661: 'Orange County',
94061: 'San Mateo County',
95758: 'Sacramento County',
96091: 'Trinity County',
94066: 'San Mateo County',
94939: 'Marin County',
95138: 'Santa Clara County',
95762: 'El Dorado County',
92064: 'San Diego County',
94708: 'Alameda County',
92106: 'San Diego County',
92116: 'San Diego County',
91302: 'Los Angeles County',
90048: 'Los Angeles County',
90405: 'Los Angeles County',
92325: 'San Bernardino County',
91116: 'Los Angeles County',
92868: 'Orange County',
90638: 'Los Angeles County',
90747: 'Los Angeles County',
93611: 'Fresno County',
95833: 'Sacramento County',
91605: 'Los Angeles County',
92675: 'Orange County',
90650: 'Los Angeles County',
95820: 'Sacramento County',
90018: 'Los Angeles County',
93711: 'Fresno County',
95973: 'Butte County',
92886: 'Orange County',
95812: 'Sacramento County',
91203: 'Los Angeles County',
91105: 'Los Angeles County',
95008: 'Santa Clara County',
90016: 'Los Angeles County',
90035: 'Los Angeles County',
92129: 'San Diego County',
90720: 'Orange County',
94949: 'Marin County',
90041: 'Los Angeles County',
95003: 'Santa Cruz County',
95192: 'Santa Clara County',
91101: 'Los Angeles County',
94126: 'San Francisco County',
90230: 'Los Angeles County',
93101: 'Santa Barbara County',
91365: 'Los Angeles County',
91367: 'Los Angeles County',
91763: 'San Bernardino County',
92660: 'Orange County',
92104: 'San Diego County',
91361: 'Ventura County',
90011: 'Los Angeles County',
90032: 'Los Angeles County',
95354: 'Stanislaus County',
94546: 'Alameda County',
92673: 'Orange County',
95741: 'Sacramento County',
95351: 'Stanislaus County',
92399: 'San Bernardino County',
90274: 'Los Angeles County',
94087: 'Santa Clara County',
90044: 'Los Angeles County',
94131: 'San Francisco County',
94124: 'San Francisco County',
95032: 'Santa Clara County',
90212: 'Los Angeles County',
93109: 'Santa Barbara County',
94019: 'San Mateo County',
95828: 'Sacramento County',
90086: 'Los Angeles County',
94555: 'Alameda County',
93033: 'Ventura County',
93022: 'Ventura County',
91343: 'Los Angeles County',
91911: 'San Diego County',
94803: 'Contra Costa County',
94553: 'Contra Costa County',
95211: 'San Joaquin County',
90304: 'Los Angeles County',
92084: 'San Diego County',
90601: 'Los Angeles County',
92704: 'Orange County',
92350: 'San Bernardino County',
94705: 'Alameda County',
93401: 'San Luis Obispo County',
90502: 'Los Angeles County',
94571: 'Solano County',
95070: 'Santa Clara County',
92735: 'Orange County',
95037: 'Santa Clara County',
95135: 'Santa Clara County',
94028: 'San Mateo County',
96003: 'Shasta County',
91024: 'Los Angeles County',
90065: 'Los Angeles County',
95405: 'Sonoma County',
95370: 'Tuolumne County',
93727: 'Fresno County',
92867: 'Orange County',
95821: 'Sacramento County',
94566: 'Alameda County',
95125: 'Santa Clara County',
94526: 'Contra Costa County',
94604: 'Alameda County',
96008: 'Shasta County',
93065: 'Ventura County',
96001: 'Shasta County',
95006: 'Santa Cruz County',
90639: 'Los Angeles County',
92630: 'Orange County',
95307: 'Stanislaus County',
91801: 'Los Angeles County',
94302: 'Santa Clara County',
91710: 'San Bernardino County',
93950: 'Monterey County',
90059: 'Los Angeles County',
94108: 'San Francisco County',
94558: 'Napa County',
93933: 'Monterey County',
92161: 'San Diego County',
94507: 'Contra Costa County',
94575: 'Contra Costa County',
95449: 'Mendocino County',
93403: 'San Luis Obispo County',
93460: 'Santa Barbara County',
95005: 'Santa Cruz County',
93302: 'Kern County',
94040: 'Santa Clara County',
91401: 'Los Angeles County',
95816: 'Sacramento County',
92624: 'Orange County',
95131: 'Santa Clara County',
94965: 'Marin County',
91784: 'San Bernardino County',
91765: 'Los Angeles County',
90280: 'Los Angeles County',
95422: 'Lake County',
95518: 'Humboldt County',
95193: 'Santa Clara County',
92694: 'Orange County',
90275: 'Los Angeles County',
90272: 'Los Angeles County',
91791: 'Los Angeles County',
92705: 'Orange County',
91773: 'Los Angeles County',
93003: 'Ventura County',
90755: 'Los Angeles County',
96145: 'Placer County',
94703: 'Alameda County',
96094: 'Siskiyou County',
95842: 'Sacramento County',
94116: 'San Francisco County',
90068: 'Los Angeles County',
94970: 'Marin County',
90813: 'Los Angeles County',
94404: 'San Mateo County',
94598: 'Contra Costa County'}
We got almost all county expect for 96651,92634,93077,92717. We can fix this zip code by searching internet. Couldn't find for other zipcodes.
dict_zip.update({92717:'Orange County'})
dict_zip.update({92634:'Orange County'})
df['County']=df['ZIPCode'].map(dict_zip)
df.County.nunique()
40
df.info()
Personal_Loan, Securities_Account, CD_Account, 'Online', 'CreditCard' ,Education are of int/object type, we can change them to category type to reduce the dataspace required.
# converting categorical varaible to category type
category_col = ['Personal_Loan', 'Securities_Account','Family', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode', 'Education','County']
df[category_col] = df[category_col].astype('category')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null category 5 Family 5000 non-null category 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category 14 County 5000 non-null category dtypes: category(9), float64(1), int64(5) memory usage: 305.4 KB
we can see that the memory usage has decreased from 547.0 KB to 305.4 KB
Processing Experience
# checking negative and zero values for experience.
df[df['Experience']<0]['Age'].describe()
count 52.00000 mean 24.51923 std 1.47516 min 23.00000 25% 24.00000 50% 24.00000 75% 25.00000 max 29.00000 Name: Age, dtype: float64
df[df['Experience']<0].sort_values(by='Experience',ascending=True)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4514 | 4515 | 24 | -3 | 41 | 91768 | 4 | 1.00000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 2618 | 2619 | 23 | -3 | 55 | 92704 | 3 | 2.40000 | 2 | 145 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 4285 | 4286 | 23 | -3 | 149 | 93555 | 2 | 7.20000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Kern County |
| 3626 | 3627 | 24 | -3 | 28 | 90089 | 4 | 1.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 2717 | 2718 | 23 | -2 | 45 | 95422 | 4 | 0.60000 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Lake County |
| 2876 | 2877 | 24 | -2 | 80 | 91107 | 2 | 1.60000 | 3 | 238 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 3130 | 3131 | 23 | -2 | 82 | 92152 | 2 | 1.80000 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | San Diego County |
| 2466 | 2467 | 24 | -2 | 80 | 94105 | 2 | 1.60000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 3796 | 3797 | 24 | -2 | 50 | 94920 | 3 | 2.40000 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | Marin County |
| 2962 | 2963 | 23 | -2 | 81 | 91711 | 2 | 1.80000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles County |
| 793 | 794 | 24 | -2 | 150 | 94720 | 2 | 2.00000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 3887 | 3888 | 24 | -2 | 118 | 92634 | 2 | 7.20000 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | Orange County |
| 889 | 890 | 24 | -2 | 82 | 91103 | 2 | 1.60000 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 597 | 598 | 24 | -2 | 125 | 92835 | 2 | 7.20000 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | Orange County |
| 315 | 316 | 24 | -2 | 51 | 90630 | 3 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 451 | 452 | 28 | -2 | 48 | 94132 | 2 | 1.75000 | 3 | 89 | 0 | 0 | 0 | 1 | 0 | San Francisco County |
| 4116 | 4117 | 24 | -2 | 135 | 90065 | 2 | 7.20000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4481 | 4482 | 25 | -2 | 35 | 95045 | 4 | 1.00000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 4411 | 4412 | 23 | -2 | 75 | 90291 | 2 | 1.80000 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles County |
| 4015 | 4016 | 25 | -1 | 139 | 93106 | 2 | 2.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | Santa Barbara County |
| 3157 | 3158 | 23 | -1 | 13 | 94720 | 4 | 1.00000 | 1 | 84 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 3279 | 3280 | 26 | -1 | 44 | 94901 | 1 | 2.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Marin County |
| 3284 | 3285 | 25 | -1 | 101 | 95819 | 4 | 2.10000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
| 3292 | 3293 | 25 | -1 | 13 | 95616 | 4 | 0.40000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Yolo County |
| 3394 | 3395 | 25 | -1 | 113 | 90089 | 4 | 2.10000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 3425 | 3426 | 23 | -1 | 12 | 91605 | 4 | 1.00000 | 1 | 90 | 0 | 0 | 0 | 1 | 0 | Los Angeles County |
| 4088 | 4089 | 29 | -1 | 71 | 94801 | 2 | 1.75000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Contra Costa County |
| 3824 | 3825 | 23 | -1 | 12 | 95064 | 4 | 1.00000 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | Santa Cruz County |
| 3946 | 3947 | 25 | -1 | 40 | 93117 | 3 | 2.40000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 3076 | 3077 | 29 | -1 | 62 | 92672 | 2 | 1.75000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 89 | 90 | 25 | -1 | 113 | 94303 | 4 | 2.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | San Mateo County |
| 2848 | 2849 | 24 | -1 | 78 | 94720 | 2 | 1.80000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County |
| 226 | 227 | 24 | -1 | 39 | 94085 | 2 | 1.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 524 | 525 | 24 | -1 | 75 | 93014 | 4 | 0.20000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Santa Barbara County |
| 536 | 537 | 25 | -1 | 43 | 92173 | 3 | 2.40000 | 2 | 176 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 540 | 541 | 25 | -1 | 109 | 94010 | 4 | 2.30000 | 3 | 314 | 0 | 0 | 0 | 1 | 0 | San Mateo County |
| 576 | 577 | 25 | -1 | 48 | 92870 | 3 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Orange County |
| 583 | 584 | 24 | -1 | 38 | 95045 | 2 | 1.70000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | San Benito County |
| 649 | 650 | 25 | -1 | 82 | 92677 | 4 | 2.10000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 670 | 671 | 23 | -1 | 61 | 92374 | 4 | 2.60000 | 1 | 239 | 0 | 0 | 0 | 1 | 0 | San Bernardino County |
| 2980 | 2981 | 25 | -1 | 53 | 94305 | 3 | 2.40000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 686 | 687 | 24 | -1 | 38 | 92612 | 4 | 0.60000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 1173 | 1174 | 24 | -1 | 35 | 94305 | 2 | 1.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Santa Clara County |
| 1428 | 1429 | 25 | -1 | 21 | 94583 | 4 | 0.40000 | 1 | 90 | 0 | 0 | 0 | 1 | 0 | Contra Costa County |
| 1522 | 1523 | 25 | -1 | 101 | 94720 | 4 | 2.30000 | 3 | 256 | 0 | 0 | 0 | 0 | 1 | Alameda County |
| 1905 | 1906 | 25 | -1 | 112 | 92507 | 2 | 2.00000 | 1 | 241 | 0 | 0 | 0 | 1 | 0 | Riverside County |
| 2102 | 2103 | 25 | -1 | 81 | 92647 | 2 | 1.60000 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | Orange County |
| 2430 | 2431 | 23 | -1 | 73 | 92120 | 4 | 2.60000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | San Diego County |
| 2545 | 2546 | 25 | -1 | 39 | 94720 | 3 | 2.40000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Alameda County |
| 4582 | 4583 | 25 | -1 | 69 | 92691 | 3 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County |
| 909 | 910 | 23 | -1 | 149 | 91709 | 1 | 6.33000 | 1 | 305 | 0 | 0 | 0 | 0 | 1 | San Bernardino County |
| 4957 | 4958 | 29 | -1 | 50 | 95842 | 2 | 1.75000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Sacramento County |
df.groupby(['Age','Education'])['Experience'].describe().T
| Age | 23 | 24 | 25 | 26 | ... | 64 | 65 | 66 | 67 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Education | 1 | 2 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | ... | 3 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 2 | 3 |
| count | 7.00000 | 5.00000 | 13.00000 | 8.00000 | 7.00000 | 23.00000 | 11.00000 | 19.00000 | 30.00000 | 24.00000 | ... | 30.00000 | 30.00000 | 23.00000 | 27.00000 | 9.00000 | 5.00000 | 10.00000 | 5.00000 | 3.00000 | 4.00000 |
| mean | -1.28571 | -2.20000 | -0.69231 | -0.87500 | -2.00000 | 0.52174 | -0.27273 | -0.52632 | 1.43333 | 0.75000 | ... | 38.53333 | 39.86667 | 40.08696 | 39.51852 | 41.11111 | 41.40000 | 40.40000 | 41.60000 | 42.66667 | 41.50000 |
| std | 0.75593 | 0.44721 | 0.94733 | 0.64087 | 1.00000 | 0.79026 | 0.64667 | 0.69669 | 0.81720 | 0.84699 | ... | 1.75643 | 0.73030 | 0.73318 | 1.55342 | 0.33333 | 0.54772 | 1.64655 | 0.89443 | 0.57735 | 0.57735 |
| min | -3.00000 | -3.00000 | -2.00000 | -2.00000 | -3.00000 | -1.00000 | -1.00000 | -2.00000 | 0.00000 | -1.00000 | ... | 34.00000 | 39.00000 | 39.00000 | 35.00000 | 41.00000 | 41.00000 | 36.00000 | 41.00000 | 42.00000 | 41.00000 |
| 25% | -1.00000 | -2.00000 | -2.00000 | -1.00000 | -2.50000 | 0.00000 | -1.00000 | -1.00000 | 1.00000 | 0.00000 | ... | 38.00000 | 39.00000 | 40.00000 | 39.00000 | 41.00000 | 41.00000 | 40.25000 | 41.00000 | 42.50000 | 41.00000 |
| 50% | -1.00000 | -2.00000 | 0.00000 | -1.00000 | -2.00000 | 1.00000 | 0.00000 | -1.00000 | 2.00000 | 1.00000 | ... | 39.00000 | 40.00000 | 40.00000 | 40.00000 | 41.00000 | 41.00000 | 41.00000 | 41.00000 | 43.00000 | 41.50000 |
| 75% | -1.00000 | -2.00000 | 0.00000 | -0.75000 | -2.00000 | 1.00000 | 0.00000 | 0.00000 | 2.00000 | 1.00000 | ... | 40.00000 | 40.00000 | 41.00000 | 41.00000 | 41.00000 | 42.00000 | 41.00000 | 42.00000 | 43.00000 | 42.00000 |
| max | -1.00000 | -2.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 1.00000 | 2.00000 | 2.00000 | ... | 40.00000 | 41.00000 | 41.00000 | 41.00000 | 42.00000 | 42.00000 | 42.00000 | 43.00000 | 43.00000 | 42.00000 |
8 rows × 134 columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00000 | 2500.50000 | 1443.52000 | 1.00000 | 1250.75000 | 2500.50000 | 3750.25000 | 5000.00000 |
| Age | 5000.00000 | 45.33840 | 11.46317 | 23.00000 | 35.00000 | 45.00000 | 55.00000 | 67.00000 |
| Experience | 5000.00000 | 20.10460 | 11.46795 | -3.00000 | 10.00000 | 20.00000 | 30.00000 | 43.00000 |
| Income | 5000.00000 | 73.77420 | 46.03373 | 8.00000 | 39.00000 | 64.00000 | 98.00000 | 224.00000 |
| CCAvg | 5000.00000 | 1.93794 | 1.74766 | 0.00000 | 0.70000 | 1.50000 | 2.50000 | 10.00000 |
| Mortgage | 5000.00000 | 56.49880 | 101.71380 | 0.00000 | 0.00000 | 0.00000 | 101.00000 | 635.00000 |
for column in category_col:
print(df[column].value_counts())
print("#" * 40)
0 4520
1 480
Name: Personal_Loan, dtype: int64
########################################
0 4478
1 522
Name: Securities_Account, dtype: int64
########################################
1 1472
2 1296
4 1222
3 1010
Name: Family, dtype: int64
########################################
0 4698
1 302
Name: CD_Account, dtype: int64
########################################
1 2984
0 2016
Name: Online, dtype: int64
########################################
0 3530
1 1470
Name: CreditCard, dtype: int64
########################################
94720 169
94305 127
95616 116
90095 71
93106 57
...
93077 1
94965 1
90068 1
94970 1
91024 1
Name: ZIPCode, Length: 467, dtype: int64
########################################
1 2096
3 1501
2 1403
Name: Education, dtype: int64
########################################
Los Angeles County 1095
San Diego County 568
Santa Clara County 563
Alameda County 500
Orange County 366
San Francisco County 257
San Mateo County 204
Sacramento County 184
Santa Barbara County 154
Yolo County 130
Monterey County 128
Ventura County 114
San Bernardino County 101
Contra Costa County 85
Santa Cruz County 68
Riverside County 56
Marin County 54
Kern County 54
Solano County 33
San Luis Obispo County 33
Humboldt County 32
Sonoma County 28
Fresno County 26
Placer County 24
Butte County 19
Shasta County 18
El Dorado County 17
Stanislaus County 15
San Benito County 14
San Joaquin County 13
Mendocino County 8
Tuolumne County 7
Siskiyou County 7
96651 6
Trinity County 4
Lake County 4
Merced County 4
Imperial County 3
Napa County 3
93077 1
Name: County, dtype: int64
########################################
Observations
def dist_box(data):
# function plots a combined graph for univariate analysis of continous variable
#to check spread, central tendency , dispersion and outliers
Name=data.name.upper()
fig,(ax_box,ax_dis) =plt.subplots(nrows=2,sharex=True,gridspec_kw = {"height_ratios": (.25, .75)},figsize=(8, 5))
mean=data.mean()
median=data.median()
mode=data.mode().tolist()[0]
sns.set_theme(style="white")
fig.suptitle("SPREAD OF DATA FOR "+ Name , fontsize=18, fontweight='bold')
sns.boxplot(x=data,showmeans=True, orient='h',color="violet",ax=ax_box)
ax_box.set(xlabel='')
# just trying to make visualisation better. This will set background to white
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.distplot(data,kde=False,color='blue',ax=ax_dis)
ax_dis.axvline(mean, color='r', linestyle='--',linewidth=2)
ax_dis.axvline(median, color='g', linestyle='-',linewidth=2)
ax_dis.axvline(mode, color='y', linestyle='-',linewidth=2)
plt.legend({'Mean':mean,'Median':median,'Mode':mode})
#select all quantitative columns for checking the spread
list_col= ['Age','Experience','Income','CCAvg','Mortgage']
for i in range(len(list_col)):
dist_box(df[list_col[i]])
Observations
Age and experience both has same distrubtion with spike at 5 points. Income is right skewed and has some outlier on higher side which can be clipped. Average montly credit is right skewed and has lot of outliers on higher side which can be clipped. Mortgage is mostly 0 . but is right skewed and has lot of outlier on higher side which can be clipped.
Age
Age can be a vital factor in borrowinng loan, converting ages to bin to explore if there is any pattern
df['Agebin'] = pd.cut(df['Age'], bins = [0, 30, 40, 50, 60, 100], labels = ['18-30', '31-40', '41-50', '51-60', '60-100'])
Income
To understand customers segments derving new columns which will help us identify if customer belongs to Upper , middle or lower income group
# Create a new variable - Income group
df["Income_group"] = pd.cut(
x=df["Income"],
bins=[0, 50, 140, 224],
labels=["Lower", "Middle", "High"],
)
Spending
To understand customers spending derving new columns which will say if customer belongs to Upper , middle or lower spending
df.CCAvg.describe()
count 5000.00000 mean 1.93794 std 1.74766 min 0.00000 25% 0.70000 50% 1.50000 75% 2.50000 max 10.00000 Name: CCAvg, dtype: float64
df["Spending_group"] = pd.cut( x=df["CCAvg"], bins=[0.00000, 0.70000, 2.50000, 10.00000],
labels=["Low", "Medium", "High"],include_lowest=True ,
)
# Making a list of all categorical variables
cat_columns = ['Family','Education','Personal_Loan','Securities_Account',
'CD_Account','Online','CreditCard','Agebin','Income_group','Spending_group']
title=['Number of Family','Education','Customers who took Personal Loan',
' Customer has Securities Account','Customers has a CD Account',
'Customers who transcat Online',' Customers who has Credit Card','Agebins',"Income group",'Spending group']
plt.figure(figsize=(14,20))
sns.set_theme(style="white") # just trying to make visualisation better. This will set background to white
#list_palette=['Blues_r','Greens_r','Purples_r','Reds_r','Blues_r','Greens_r','Purples_r','Reds_r','Blues_r']
for i, variable in enumerate(cat_columns):
plt.subplot(5,2,i+1)
order = df[variable].value_counts(ascending=False).index
#sns.set_palette(list_palette[i]) # to set the palette
sns.set_palette('Set2')
ax=sns.countplot(x=df[variable], data=df )
sns.despine(top=True,right=True,left=True) # to remove side line from graph
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/len(df[variable]))
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
plt.annotate(percentage, (x, y),ha='center')
plt.tight_layout()
plt.title(title[i].upper())
Observations
df.groupby(['County','Personal_Loan'])['Personal_Loan'].agg({'size'}).unstack()
| size | ||
|---|---|---|
| Personal_Loan | 0 | 1 |
| County | ||
| 93077 | 1 | 0 |
| 96651 | 6 | 0 |
| Alameda County | 456 | 44 |
| Butte County | 17 | 2 |
| Contra Costa County | 73 | 12 |
| El Dorado County | 16 | 1 |
| Fresno County | 24 | 2 |
| Humboldt County | 30 | 2 |
| Imperial County | 3 | 0 |
| Kern County | 47 | 7 |
| Lake County | 4 | 0 |
| Los Angeles County | 984 | 111 |
| Marin County | 48 | 6 |
| Mendocino County | 7 | 1 |
| Merced County | 4 | 0 |
| Monterey County | 113 | 15 |
| Napa County | 3 | 0 |
| Orange County | 333 | 33 |
| Placer County | 22 | 2 |
| Riverside County | 50 | 6 |
| Sacramento County | 169 | 15 |
| San Benito County | 14 | 0 |
| San Bernardino County | 98 | 3 |
| San Diego County | 509 | 59 |
| San Francisco County | 238 | 19 |
| San Joaquin County | 12 | 1 |
| San Luis Obispo County | 28 | 5 |
| San Mateo County | 192 | 12 |
| Santa Barbara County | 143 | 11 |
| Santa Clara County | 492 | 71 |
| Santa Cruz County | 60 | 8 |
| Shasta County | 15 | 3 |
| Siskiyou County | 7 | 0 |
| Solano County | 30 | 3 |
| Sonoma County | 22 | 6 |
| Stanislaus County | 14 | 1 |
| Trinity County | 4 | 0 |
| Tuolumne County | 7 | 0 |
| Ventura County | 103 | 11 |
| Yolo County | 122 | 8 |
plt.figure(figsize=(15,24))
pd.crosstab(index=df['County'],columns=df['Personal_Loan'].sort_values(ascending=False)).plot(kind='barh',stacked=True,figsize=(15,24))
<AxesSubplot:ylabel='County'>
<Figure size 1080x1728 with 0 Axes>
It can be seen the percentage of loan taken from various country differ.There are so many county converting them to regions will help in our model
counties = {
'Los Angeles County':'Los Angeles Region',
'San Diego County':'Southern',
'Santa Clara County':'Bay Area',
'Alameda County':'Bay Area',
'Orange County':'Southern',
'San Francisco County':'Bay Area',
'San Mateo County':'Bay Area',
'Sacramento County':'Central',
'Santa Barbara County':'Southern',
'Yolo County':'Central',
'Monterey County':'Bay Area',
'Ventura County':'Southern',
'San Bernardino County':'Southern',
'Contra Costa County':'Bay Area',
'Santa Cruz County':'Bay Area',
'Riverside County':'Southern',
'Kern County':'Southern',
'Marin County':'Bay Area',
'San Luis Obispo County':'Southern',
'Solano County':'Bay Area',
'Humboldt County':'Superior',
'Sonoma County':'Bay Area',
'Fresno County':'Central',
'Placer County':'Central',
'Butte County':'Superior',
'Shasta County':'Superior',
'El Dorado County':'Central',
'Stanislaus County':'Central',
'San Benito County':'Bay Area',
'San Joaquin County':'Central',
'Mendocino County':'Superior',
'Tuolumne County':'Central',
'Siskiyou County':'Superior',
'Trinity County':'Superior',
'Merced County':'Central',
'Lake County':'Superior',
'Napa County':'Bay Area',
'Imperial County':'Southern',
93077:'Southern',
96651:'Bay Area'
}
df['Regions'] = df['County'].map(counties)
df['Regions'].unique()
array(['Los Angeles Region', 'Bay Area', 'Southern', 'Superior',
'Central'], dtype=object)
df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 County 0 Agebin 0 Income_group 0 Spending_group 0 Regions 0 dtype: int64
df.dropna(inplace=True)
plt.figure(figsize=(9,7))
sns.countplot(data=df,x=df['Regions'])
sns.despine(top=True,right=True,left=True) # to remove side line from graph
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 0 to 4999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null category 5 Family 5000 non-null category 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category 14 County 5000 non-null category 15 Agebin 5000 non-null category 16 Income_group 5000 non-null category 17 Spending_group 5000 non-null category 18 Regions 5000 non-null object dtypes: category(12), float64(1), int64(5), object(1) memory usage: 527.6+ KB
sns.set_palette(sns.color_palette("Set2", 8))
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True)
plt.show()
Observations
As expected Age and experience are highly correlated and one of them can be dropped.Since we had to handle 0, will drop experience. Income and Average spending on credit card are positively corrleated. Mortgage has very little correlation with income.
sns.set_palette(sns.color_palette("Set2", 8))
sns.pairplot(df, hue="Personal_Loan",corner=True)
plt.show()
numeric_columns = ['Age','Experience','Income','CCAvg','Mortgage']
plt.figure(figsize=(15,25))
sns.set_palette(sns.color_palette("Set2", 8))
for i, variable in enumerate(numeric_columns):
plt.subplot(10,3,i+1)
sns.boxplot(x='Personal_Loan',y= df[variable], data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.tight_layout()
plt.title(variable.upper())
sns.distplot( df[df['Personal_Loan'] == 0]['CCAvg'], color = 'g')
sns.distplot( df[df['Personal_Loan'] == 1]['CCAvg'], color = 'r')
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.distplot( df[df['Personal_Loan'] == 0]['Income'], color = 'g')
sns.distplot( df[df['Personal_Loan'] == 1]['Income'], color = 'r')
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.relplot(x='Income_group',y='CCAvg',hue='Personal_Loan',data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.distplot( df[df['Personal_Loan'] == 0]['Mortgage'], color = 'g')
sns.distplot( df[df['Personal_Loan'] == 1]['Mortgage'], color = 'r')
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.swarmplot(x='Income_group',y='Mortgage',hue='Personal_Loan',data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.swarmplot(x='Education',y='Income',hue='Personal_Loan',data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
labels=["No","Yes"]
plt.legend(loc='lower left', frameon=False,)
plt.legend(loc="upper left", labels=labels,title="Borrowed Loan",bbox_to_anchor=(1,1))
<matplotlib.legend.Legend at 0x1e55b0393d0>
sns.countplot(x='Income_group',hue='CD_Account',data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.distplot( df[df['Personal_Loan'] == 0]['Age'], color = 'g')
sns.distplot( df[df['Personal_Loan'] == 1]['Age'], color = 'r')
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.countplot(x='Spending_group',hue='Personal_Loan',data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.catplot(y='Income',x='Agebin',hue='Education',kind='bar',col="Personal_Loan", data=df)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
sns.set_palette(sns.color_palette("Set2", 8))
sns.barplot(y='CCAvg',x='Education', hue='Personal_Loan',data=df)
labels=["No","Yes"]
plt.legend(loc="upper left", title="Borrowed Loan",bbox_to_anchor=(1,1))
sns.despine(top=True,right=True,left=True) # to remove side line from graph
Observations
People with higher income had opted for personal loan before.
People with high mortgages opted for loan.
Customers with higher average monthly credit usage have opted for loan.
Customers with higher income had higher average credit card usage and mortgage.
Graduate and Advanced/Professional have higher monhtly credit card usage and have borrowed loans with the bank.
## Function to plot stacked bar chart
def stacked_plot(x):
sns.set_palette(sns.color_palette("Set2", 8))
tab1 = pd.crosstab(x,df['Personal_Loan'],margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(x,df['Personal_Loan'],normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(7,4))
plt.xticks(rotation=360)
labels=["No","Yes"]
plt.legend(loc='lower left', frameon=False,)
plt.legend(loc="upper left", labels=labels,title="Borrowed Loan",bbox_to_anchor=(1,1))
sns.despine(top=True,right=True,left=True) # to remove side line from graph
#plt.legend(labels)
plt.show()
cat_columns=['Family','Education','Securities_Account','CD_Account','CreditCard','Online','Regions','Agebin','Income_group','Spending_group']
for i, variable in enumerate(cat_columns):
stacked_plot(df[variable])
Personal_Loan 0 1 All Family 1 1365 107 1472 2 1190 106 1296 3 877 133 1010 4 1088 134 1222 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Education 1 2003 93 2096 2 1221 182 1403 3 1296 205 1501 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Securities_Account 0 4058 420 4478 1 462 60 522 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CD_Account 0 4358 340 4698 1 162 140 302 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CreditCard 0 3193 337 3530 1 1327 143 1470 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Online 0 1827 189 2016 1 2693 291 2984 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Regions Bay Area 1747 196 1943 Central 390 30 420 Los Angeles Region 984 111 1095 Southern 1315 135 1450 Superior 84 8 92 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Agebin 18-30 558 66 624 31-40 1118 118 1236 41-50 1148 122 1270 51-60 1208 115 1323 60-100 488 59 547 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Income_group Lower 1914 0 1914 Middle 2301 223 2524 High 305 257 562 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Spending_group Low 1342 39 1381 Medium 2292 79 2371 High 886 362 1248 All 4520 480 5000 ------------------------------------------------------------------------------------------------------------------------
Observations
plt.figure(figsize=(10,5))
numerical=['Income','Age','Experience','CCAvg','Mortgage']
# A better color to see the positive or negative correlation of each variable
heatmap = sns.heatmap(df[numerical].corr(), annot=True, cmap='YlGnBu',linewidths=0.5)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=2);
plt.pie(data=df,x=df["Personal_Loan"].value_counts(),autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1e55a8cb0a0>, <matplotlib.patches.Wedge at 0x1e55b21f460>], [Text(-1.050350983068654, 0.3267457916587637, ''), Text(1.0503509677725784, -0.32674584082924163, '')], [Text(-0.5729187180374475, 0.17822497726841657, '90.4%'), Text(0.5729187096941336, -0.17822500408867725, '9.6%')])
The target variable personal_loan is highly imbalanced where only 9.6% of the customers have previously opted for a personal loan in the dataset. This can be handled using weight or SMOTE.But for now we will carry with on without SMOTE
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | 18-30 | Lower | Medium | Los Angeles Region |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | 41-50 | Lower | Medium | Los Angeles Region |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County | 31-40 | Lower | Medium | Bay Area |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County | 31-40 | Middle | High | Bay Area |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County | 31-40 | Lower | Medium | Los Angeles Region |
# Saving dataset before treating outliers for logistic regression.
df_Decision = df.copy()
Data Description:
Data Cleaning:
Observations from EDA:
Actions for data pre-processing:
Outliers detection:
numeric_columns =['Income','CCAvg','Mortgage','Age']
# outlier detection using boxplot
plt.figure(figsize=(20,30))
for i, variable in enumerate(numeric_columns):
plt.subplot(4,4,i+1)
plt.boxplot(df[variable],whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Check Income extreme values
df.sort_values(by=["Income"],ascending = False).head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67000 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | Monterey County | 41-50 | High | High | Bay Area |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County | 41-50 | High | High | Los Angeles Region |
| 526 | 527 | 26 | 2 | 205 | 93106 | 1 | 6.33000 | 1 | 271 | 0 | 0 | 0 | 0 | 1 | Santa Barbara County | 18-30 | High | High | Southern |
| 2988 | 2989 | 46 | 21 | 205 | 95762 | 2 | 8.80000 | 1 | 181 | 0 | 1 | 0 | 1 | 0 | El Dorado County | 41-50 | High | High | Central |
| 4225 | 4226 | 43 | 18 | 204 | 91902 | 2 | 8.80000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | San Diego County | 41-50 | High | High | Southern |
df.loc[(df['Age']==48) & (df['Experience']==24)].sort_values(by=['Income'],ascending=False).head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67000 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | Monterey County | 41-50 | High | High | Bay Area |
| 196 | 197 | 48 | 24 | 165 | 93407 | 1 | 5.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | San Luis Obispo County | 41-50 | High | High | Southern |
| 2217 | 2218 | 48 | 24 | 162 | 91355 | 4 | 3.30000 | 2 | 446 | 1 | 0 | 1 | 1 | 0 | Los Angeles County | 41-50 | High | High | Los Angeles Region |
| 4629 | 4630 | 48 | 24 | 148 | 91311 | 2 | 3.30000 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | Los Angeles County | 41-50 | High | High | Los Angeles Region |
| 4167 | 4168 | 48 | 24 | 144 | 94025 | 4 | 3.50000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | San Mateo County | 41-50 | High | High | Bay Area |
# Check Mortgage extreme values
df.sort_values(by=["Mortgage"],ascending = False).head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2934 | 2935 | 37 | 13 | 195 | 91763 | 2 | 6.50000 | 1 | 635 | 0 | 0 | 0 | 1 | 0 | San Bernardino County | 31-40 | High | High | Southern |
| 303 | 304 | 49 | 25 | 195 | 95605 | 4 | 3.00000 | 1 | 617 | 1 | 0 | 0 | 0 | 0 | Yolo County | 41-50 | High | High | Central |
| 4812 | 4813 | 29 | 4 | 184 | 92126 | 4 | 2.20000 | 3 | 612 | 1 | 0 | 0 | 1 | 0 | San Diego County | 18-30 | High | Medium | Southern |
| 1783 | 1784 | 53 | 27 | 192 | 94720 | 1 | 1.70000 | 1 | 601 | 0 | 0 | 0 | 1 | 0 | Alameda County | 51-60 | High | Medium | Bay Area |
| 4842 | 4843 | 49 | 23 | 174 | 95449 | 3 | 4.60000 | 2 | 590 | 1 | 0 | 0 | 0 | 0 | Mendocino County | 41-50 | High | High | Superior |
# Check CCAVg extreme values
df.sort_values(by=["CCAvg"],ascending = False).head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.00000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Los Angeles County | 41-50 | High | High | Los Angeles Region |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.00000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Santa Clara County | 31-40 | High | High | Bay Area |
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.00000 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | Santa Clara County | 41-50 | High | High | Bay Area |
| 3943 | 3944 | 61 | 36 | 188 | 91360 | 1 | 9.30000 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | Ventura County | 60-100 | High | High | Southern |
| 3822 | 3823 | 63 | 33 | 178 | 91768 | 4 | 9.00000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Los Angeles County | 60-100 | High | High | Los Angeles Region |
This is some really extreme values in income 224K USD compared to same age group and experience. Values for Credit card and Mortages looks fine.After identifying outliers, we can decide whether to remove/treat them or not. It depends,here I am not going to treat them as there will be outliers in real case scenario (in Income, Mortgage value, Average spending on the credit card, etc) and we would want our model to learn the underlying pattern for such customers.
df
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | County | Agebin | Income_group | Spending_group | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | 18-30 | Lower | Medium | Los Angeles Region |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles County | 41-50 | Lower | Medium | Los Angeles Region |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Alameda County | 31-40 | Lower | Medium | Bay Area |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco County | 31-40 | Middle | High | Bay Area |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Los Angeles County | 31-40 | Lower | Medium | Los Angeles Region |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.90000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Orange County | 18-30 | Lower | Medium | Southern |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.40000 | 1 | 85 | 0 | 0 | 0 | 1 | 0 | San Diego County | 18-30 | Lower | Low | Southern |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Ventura County | 60-100 | Lower | Low | Southern |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.50000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles County | 60-100 | Lower | Low | Los Angeles Region |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.80000 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Orange County | 18-30 | Middle | Medium | Southern |
5000 rows × 19 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 0 to 4999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null category 5 Family 5000 non-null category 6 CCAvg 5000 non-null float64 7 Education 5000 non-null category 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null category 10 Securities_Account 5000 non-null category 11 CD_Account 5000 non-null category 12 Online 5000 non-null category 13 CreditCard 5000 non-null category 14 County 5000 non-null category 15 Agebin 5000 non-null category 16 Income_group 5000 non-null category 17 Spending_group 5000 non-null category 18 Regions 5000 non-null object dtypes: category(12), float64(1), int64(5), object(1) memory usage: 527.6+ KB
#drop column which we don't need for modelling
df.drop(columns=["Agebin", "ZIPCode","County",'Experience','Income_group','Spending_group'], inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Income 5000 non-null int64 3 Family 5000 non-null category 4 CCAvg 5000 non-null float64 5 Education 5000 non-null category 6 Mortgage 5000 non-null int64 7 Personal_Loan 5000 non-null category 8 Securities_Account 5000 non-null category 9 CD_Account 5000 non-null category 10 Online 5000 non-null category 11 CreditCard 5000 non-null category 12 Regions 5000 non-null object dtypes: category(7), float64(1), int64(4), object(1) memory usage: 437.6+ KB
We have 6 categorical independent variables but 4 of them are binary, so we'll have the same results with them even after creating dummies So we will only make dummies for Regions and Education.
X = df.drop(['Personal_Loan'], axis=1)
Y = df['Personal_Loan']
oneHotCols=['Regions','Education']
X=pd.get_dummies(X,columns=oneHotCols,drop_first=True)
#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1,stratify=Y)
from sklearn.preprocessing import StandardScaler
# Creating StandardScaler instance
scaler = StandardScaler()
# Fitting Standard Scaller
X_scaler = scaler.fit(X_train)
# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
X_train_scaled_df = pd.DataFrame(X_train_scaled,columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled,columns=X_test.columns)
X_train_scaled_df.index=np.arange(len(X_train_scaled_df))
X_test_scaled_df.index=np.arange(len(X_test_scaled_df))
y_train.index=np.arange(len(y_train))
y_test.index=np.arange(len(y_test))
Model can make wrong predictions as:
Which case is more important?
How to reduce losses?i.e need to reduce False Negatives ?
def make_confusion_matrix(y_actual,y_predict,title):
fig, ax = plt.subplots(1, 1)
cm = confusion_matrix(y_actual, y_predict, labels=[0,1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=["No","Yes"])
disp.plot(cmap='Greens',colorbar=True,ax=ax)
ax.set_title(title)
plt.tick_params(axis=u'both', which=u'both',length=0)
plt.grid(b=None,axis='both',which='both',visible=False)
plt.show()
def get_metrics_score(model,X_train_df,X_test_df,y_train_pass,y_test_pass,statsklearn,threshold=0.5,flag=True,roc=False):
'''
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
X_train_df, X_test_df: Independent features
y_train_pass,y_test_pass: Dependent variable
statsklearn : 0 if calling for Sklearn model else 1
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
'''
# defining an empty list to store train and test results
score_list=[]
if statsklearn==0:
pred_train = model.predict(X_train_df)
pred_test = model.predict(X_test_df)
else:
pred_train = (model.predict(X_train_df)>threshold)
pred_test = (model.predict(X_test_df)>threshold)
pred_train = np.round(pred_train)
pred_test = np.round(pred_test)
train_acc = accuracy_score(y_train_pass,pred_train)
test_acc = accuracy_score(y_test_pass,pred_test)
train_recall = recall_score(y_train_pass,pred_train)
test_recall = recall_score(y_test_pass,pred_test)
train_precision = precision_score(y_train_pass,pred_train)
test_precision = precision_score(y_test_pass,pred_test)
train_f1 = f1_score(y_train_pass,pred_train)
test_f1 = f1_score(y_test_pass,pred_test)
score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
if flag == True:
print("\x1b[0;30;47m \033[1mMODEL PERFORMANCE\x1b[0m")
print("\x1b[0;30;47m \033[1mAccuracy : Train:\x1b[0m",
round(accuracy_score(y_train_pass,pred_train),3),
"\x1b[0;30;47m \033[1mTest:\x1b[0m ",
round(accuracy_score(y_test_pass,pred_test),3))
print("\x1b[0;30;47m \033[1mRecall : Train:\x1b[0m"
,round(recall_score(y_train_pass,pred_train),3),
"\x1b[0;30;47m \033[1mTest:\x1b[0m" ,
round(recall_score(y_test_pass,pred_test),3))
print("\x1b[0;30;47m \033[1mPrecision : Train:\x1b[0m",
round(precision_score(y_train_pass,pred_train),3),
"\x1b[0;30;47m \033[1mTest:\x1b[0m ",
round(precision_score(y_test_pass,pred_test),3))
print("\x1b[0;30;47m \033[1mF1 : Train:\x1b[0m",
round(f1_score(y_train_pass,pred_train),3),
"\x1b[0;30;47m \033[1mTest:\x1b[0m",
round(f1_score(y_test_pass,pred_test),3))
make_confusion_matrix(y_train_pass,pred_train,"Confusion Matrix for Train")
make_confusion_matrix(y_test_pass,pred_test,"Confusion Matrix for Test")
if roc == True:
print("\x1b[0;30;47m \033[1mROC-AUC Score :Train:\x1b[0m: ",
round(roc_auc_score(y_train_pass,pred_train),3),
"\x1b[0;30;47m \033[1mTest:\x1b[0m: ",
round(roc_auc_score(y_test_pass,pred_test),3))
return score_list # returning the list with train and test scores
# # defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []
def add_score_model(score):
'''Add scores to list so that we can compare all models score together'''
acc_train.append(score[0])
acc_test.append(score[1])
recall_train.append(score[2])
recall_test.append(score[3])
precision_train.append(score[4])
precision_test.append(score[5])
f1_train.append(score[6])
f1_test.append(score[7])
lr = LogisticRegression(solver='newton-cg',random_state=1,fit_intercept=False,class_weight={0:0.15,1:0.85})
model = lr.fit(X_train_scaled_df,y_train)
statmodel=0 #0 for sklearn and 1 for statmodel
# Let's check model performances for this model
scores_Sklearn = get_metrics_score(model,X_train_scaled_df,X_test_scaled_df,y_train,y_test,statmodel)
MODEL PERFORMANCE Accuracy : Train: 0.656 Test: 0.654 Recall : Train: 0.988 Test: 0.972 Precision : Train: 0.217 Test: 0.214 F1 : Train: 0.355 Test: 0.35
add_score_model(scores_Sklearn)
# adding constant to training and test set
X_train_stat = sm.add_constant(X_train_scaled_df)
X_test_stat = sm.add_constant(X_test_scaled_df)
statmodel=1 #0 for sklearn and 1 for statmodel
logit = sm.Logit( y_train, X_train_stat.astype(float) )
lg = logit.fit(warn_convergence=False)
# Let's check model performances for this model
scores_statmodel = get_metrics_score(lg,X_train_stat,X_test_stat,y_train,y_test,statmodel)
lg.summary()
Optimization terminated successfully.
Current function value: 0.107781
Iterations 10
MODEL PERFORMANCE
Accuracy : Train: 0.961 Test: 0.953
Recall : Train: 0.696 Test: 0.625
Precision : Train: 0.867 Test: 0.849
F1 : Train: 0.772 Test: 0.72
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3483 |
| Method: | MLE | Df Model: | 16 |
| Date: | Sat, 12 Mar 2022 | Pseudo R-squ.: | 0.6591 |
| Time: | 03:47:43 | Log-Likelihood: | -377.23 |
| converged: | True | LL-Null: | -1106.7 |
| Covariance Type: | nonrobust | LLR p-value: | 3.415e-301 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -5.4342 | 0.247 | -22.003 | 0.000 | -5.918 | -4.950 |
| ID | -0.0418 | 0.097 | -0.429 | 0.668 | -0.232 | 0.149 |
| Age | 0.2056 | 0.098 | 2.089 | 0.037 | 0.013 | 0.398 |
| Income | 3.1326 | 0.184 | 17.059 | 0.000 | 2.773 | 3.493 |
| Family | 0.7066 | 0.109 | 6.496 | 0.000 | 0.493 | 0.920 |
| CCAvg | 0.1925 | 0.097 | 1.980 | 0.048 | 0.002 | 0.383 |
| Mortgage | 0.1050 | 0.076 | 1.387 | 0.165 | -0.043 | 0.253 |
| Securities_Account | -0.3156 | 0.117 | -2.697 | 0.007 | -0.545 | -0.086 |
| CD_Account | 0.8975 | 0.105 | 8.582 | 0.000 | 0.693 | 1.102 |
| Online | -0.3430 | 0.102 | -3.371 | 0.001 | -0.542 | -0.144 |
| CreditCard | -0.4741 | 0.122 | -3.889 | 0.000 | -0.713 | -0.235 |
| Regions_Central | -0.1976 | 0.117 | -1.682 | 0.093 | -0.428 | 0.033 |
| Regions_Los Angeles Region | -0.0549 | 0.105 | -0.523 | 0.601 | -0.261 | 0.151 |
| Regions_Southern | 0.0320 | 0.105 | 0.303 | 0.762 | -0.175 | 0.239 |
| Regions_Superior | -0.2397 | 0.182 | -1.315 | 0.188 | -0.597 | 0.117 |
| Education_2 | 2.0106 | 0.156 | 12.906 | 0.000 | 1.705 | 2.316 |
| Education_3 | 2.0844 | 0.158 | 13.191 | 0.000 | 1.775 | 2.394 |
MultiCollinearity
we will have to check and remove multicollinearity from the data to get reliable coefficients and p-values. There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.* General Rule of thumb: If VIF is 1 then there is no correlation among the predictor and the remaining predictor variables. Whereas if VIF exceeds 5, we say it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use.
# changing datatype of colums to numeric for checking vif
X_train_num = X_train_stat.astype(float).copy()
vif_series1 = pd.Series([variance_inflation_factor(X_train_num.values,i) for i in range(X_train_num.shape[1])],index=X_train_num.columns, dtype = float)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: const 1.00000 ID 1.00560 Age 1.01499 Income 1.81960 Family 1.05043 CCAvg 1.68319 Mortgage 1.05034 Securities_Account 1.14110 CD_Account 1.33063 Online 1.04709 CreditCard 1.10625 Regions_Central 1.11863 Regions_Los Angeles Region 1.21752 Regions_Southern 1.23682 Regions_Superior 1.03397 Education_2 1.25009 Education_3 1.24034 dtype: float64
Observations: There is no correlation between predicator variables
lg.summary()
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3483 |
| Method: | MLE | Df Model: | 16 |
| Date: | Sat, 12 Mar 2022 | Pseudo R-squ.: | 0.6591 |
| Time: | 03:47:43 | Log-Likelihood: | -377.23 |
| converged: | True | LL-Null: | -1106.7 |
| Covariance Type: | nonrobust | LLR p-value: | 3.415e-301 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -5.4342 | 0.247 | -22.003 | 0.000 | -5.918 | -4.950 |
| ID | -0.0418 | 0.097 | -0.429 | 0.668 | -0.232 | 0.149 |
| Age | 0.2056 | 0.098 | 2.089 | 0.037 | 0.013 | 0.398 |
| Income | 3.1326 | 0.184 | 17.059 | 0.000 | 2.773 | 3.493 |
| Family | 0.7066 | 0.109 | 6.496 | 0.000 | 0.493 | 0.920 |
| CCAvg | 0.1925 | 0.097 | 1.980 | 0.048 | 0.002 | 0.383 |
| Mortgage | 0.1050 | 0.076 | 1.387 | 0.165 | -0.043 | 0.253 |
| Securities_Account | -0.3156 | 0.117 | -2.697 | 0.007 | -0.545 | -0.086 |
| CD_Account | 0.8975 | 0.105 | 8.582 | 0.000 | 0.693 | 1.102 |
| Online | -0.3430 | 0.102 | -3.371 | 0.001 | -0.542 | -0.144 |
| CreditCard | -0.4741 | 0.122 | -3.889 | 0.000 | -0.713 | -0.235 |
| Regions_Central | -0.1976 | 0.117 | -1.682 | 0.093 | -0.428 | 0.033 |
| Regions_Los Angeles Region | -0.0549 | 0.105 | -0.523 | 0.601 | -0.261 | 0.151 |
| Regions_Southern | 0.0320 | 0.105 | 0.303 | 0.762 | -0.175 | 0.239 |
| Regions_Superior | -0.2397 | 0.182 | -1.315 | 0.188 | -0.597 | 0.117 |
| Education_2 | 2.0106 | 0.156 | 12.906 | 0.000 | 1.705 | 2.316 |
| Education_3 | 2.0844 | 0.158 | 13.191 | 0.000 | 1.775 | 2.394 |
X_train1 = X_train_stat.drop(['Regions_Central', 'Regions_Los Angeles Region', 'Regions_Southern', 'Regions_Superior'], axis = 1)
X_test1= X_test_stat.drop(['Regions_Central', 'Regions_Los Angeles Region', 'Regions_Southern', 'Regions_Superior'], axis = 1)
logit1 = sm.Logit(y_train, X_train1.astype(float))
lg1 = logit1.fit(warn_convergence =False)
lg1.summary()
Optimization terminated successfully.
Current function value: 0.108617
Iterations 9
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3487 |
| Method: | MLE | Df Model: | 12 |
| Date: | Sat, 12 Mar 2022 | Pseudo R-squ.: | 0.6565 |
| Time: | 03:47:43 | Log-Likelihood: | -380.16 |
| converged: | True | LL-Null: | -1106.7 |
| Covariance Type: | nonrobust | LLR p-value: | 4.912e-304 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -5.3845 | 0.243 | -22.188 | 0.000 | -5.860 | -4.909 |
| ID | -0.0439 | 0.097 | -0.453 | 0.651 | -0.234 | 0.146 |
| Age | 0.2051 | 0.098 | 2.095 | 0.036 | 0.013 | 0.397 |
| Income | 3.1120 | 0.182 | 17.141 | 0.000 | 2.756 | 3.468 |
| Family | 0.6991 | 0.108 | 6.464 | 0.000 | 0.487 | 0.911 |
| CCAvg | 0.1971 | 0.097 | 2.042 | 0.041 | 0.008 | 0.386 |
| Mortgage | 0.0993 | 0.075 | 1.321 | 0.186 | -0.048 | 0.247 |
| Securities_Account | -0.3050 | 0.116 | -2.619 | 0.009 | -0.533 | -0.077 |
| CD_Account | 0.9001 | 0.104 | 8.656 | 0.000 | 0.696 | 1.104 |
| Online | -0.3577 | 0.101 | -3.525 | 0.000 | -0.557 | -0.159 |
| CreditCard | -0.4829 | 0.122 | -3.972 | 0.000 | -0.721 | -0.245 |
| Education_2 | 1.9993 | 0.155 | 12.921 | 0.000 | 1.696 | 2.303 |
| Education_3 | 2.0744 | 0.157 | 13.190 | 0.000 | 1.766 | 2.383 |
X_train2 = X_train1.drop(['Mortgage'], axis = 1)
X_test2= X_test1.drop(['Mortgage'], axis = 1)
logit2 = sm.Logit(y_train, X_train2.astype(float))
lg2 = logit2.fit()
lg2.summary()
Optimization terminated successfully.
Current function value: 0.108865
Iterations 9
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3488 |
| Method: | MLE | Df Model: | 11 |
| Date: | Sat, 12 Mar 2022 | Pseudo R-squ.: | 0.6557 |
| Time: | 03:47:43 | Log-Likelihood: | -381.03 |
| converged: | True | LL-Null: | -1106.7 |
| Covariance Type: | nonrobust | LLR p-value: | 9.901e-305 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -5.3760 | 0.242 | -22.216 | 0.000 | -5.850 | -4.902 |
| ID | -0.0435 | 0.097 | -0.449 | 0.653 | -0.233 | 0.146 |
| Age | 0.1995 | 0.098 | 2.041 | 0.041 | 0.008 | 0.391 |
| Income | 3.1286 | 0.181 | 17.278 | 0.000 | 2.774 | 3.484 |
| Family | 0.7023 | 0.108 | 6.493 | 0.000 | 0.490 | 0.914 |
| CCAvg | 0.1856 | 0.096 | 1.931 | 0.053 | -0.003 | 0.374 |
| Securities_Account | -0.3089 | 0.116 | -2.657 | 0.008 | -0.537 | -0.081 |
| CD_Account | 0.9084 | 0.104 | 8.746 | 0.000 | 0.705 | 1.112 |
| Online | -0.3543 | 0.101 | -3.501 | 0.000 | -0.553 | -0.156 |
| CreditCard | -0.4852 | 0.121 | -3.999 | 0.000 | -0.723 | -0.247 |
| Education_2 | 1.9904 | 0.154 | 12.899 | 0.000 | 1.688 | 2.293 |
| Education_3 | 2.0686 | 0.157 | 13.174 | 0.000 | 1.761 | 2.376 |
X_train3 = X_train2.drop(['Age'], axis = 1)
X_test3= X_test2.drop(['Age'], axis = 1)
logit3 = sm.Logit(y_train, X_train3.astype(float))
lg3 = logit3.fit()
lg3.summary()
Optimization terminated successfully.
Current function value: 0.109465
Iterations 9
| Dep. Variable: | Personal_Loan | No. Observations: | 3500 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3489 |
| Method: | MLE | Df Model: | 10 |
| Date: | Sat, 12 Mar 2022 | Pseudo R-squ.: | 0.6538 |
| Time: | 03:47:43 | Log-Likelihood: | -383.13 |
| converged: | True | LL-Null: | -1106.7 |
| Covariance Type: | nonrobust | LLR p-value: | 6.452e-305 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -5.3393 | 0.239 | -22.336 | 0.000 | -5.808 | -4.871 |
| ID | -0.0405 | 0.096 | -0.421 | 0.674 | -0.229 | 0.148 |
| Income | 3.1032 | 0.179 | 17.327 | 0.000 | 2.752 | 3.454 |
| Family | 0.6905 | 0.108 | 6.395 | 0.000 | 0.479 | 0.902 |
| CCAvg | 0.1604 | 0.096 | 1.674 | 0.094 | -0.027 | 0.348 |
| Securities_Account | -0.3169 | 0.116 | -2.727 | 0.006 | -0.545 | -0.089 |
| CD_Account | 0.9170 | 0.104 | 8.840 | 0.000 | 0.714 | 1.120 |
| Online | -0.3549 | 0.101 | -3.510 | 0.000 | -0.553 | -0.157 |
| CreditCard | -0.4807 | 0.121 | -3.967 | 0.000 | -0.718 | -0.243 |
| Education_2 | 1.9721 | 0.153 | 12.876 | 0.000 | 1.672 | 2.272 |
| Education_3 | 2.0460 | 0.155 | 13.175 | 0.000 | 1.742 | 2.350 |
CCavg is important parameter as per EDA so not dropping it
# Let's check model performances for this model
scores_statmodel = get_metrics_score(lg3,X_train3,X_test3,y_train,y_test,statmodel)
add_score_model(scores_statmodel)
MODEL PERFORMANCE Accuracy : Train: 0.961 Test: 0.954 Recall : Train: 0.696 Test: 0.639 Precision : Train: 0.873 Test: 0.844 F1 : Train: 0.775 Test: 0.727
logit_roc_auc_train = roc_auc_score(y_train, lg3.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_train)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic Train data')
plt.legend(loc="lower right")
plt.show()
logit_roc_auc_test = roc_auc_score(y_test, lg3.predict(X_test3))
fpr, tpr, thresholds = roc_curve(y_test, lg3.predict(X_test3))
plt.figure(figsize=(7,5))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc_test)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic- Test data')
plt.legend(loc="lower right")
plt.show()
Logistic Regression model is giving a generalized performance on training and test set. ROC-AUC score of 0.96 on training and test set is quite good.
Coefficient interpretations
Converting coefficients to odds
#Calculate Odds Ratio, probability
##create a data frame to collate Odds ratio, probability and p-value of the coef
lgcoef = pd.DataFrame(lg3.params, columns=['coef'])
lgcoef.loc[:, "Odds Ratio"] = np.exp(lgcoef.coef)
lgcoef['Probability'] = lgcoef['Odds Ratio']/(1+lgcoef['Odds Ratio'])
lgcoef['Percentage Change of Odds']=(np.exp(lg3.params)-1)*100
lgcoef['pval']=lg3.pvalues
pd.options.display.float_format = '{:.2f}'.format
lgcoef = lgcoef.sort_values(by="Odds Ratio", ascending=False)
lgcoef
| coef | Odds Ratio | Probability | Percentage Change of Odds | pval | |
|---|---|---|---|---|---|
| Income | 3.10 | 22.27 | 0.96 | 2126.98 | 0.00 |
| Education_3 | 2.05 | 7.74 | 0.89 | 673.70 | 0.00 |
| Education_2 | 1.97 | 7.19 | 0.88 | 618.55 | 0.00 |
| CD_Account | 0.92 | 2.50 | 0.71 | 150.17 | 0.00 |
| Family | 0.69 | 1.99 | 0.67 | 99.47 | 0.00 |
| CCAvg | 0.16 | 1.17 | 0.54 | 17.39 | 0.09 |
| ID | -0.04 | 0.96 | 0.49 | -3.97 | 0.67 |
| Securities_Account | -0.32 | 0.73 | 0.42 | -27.16 | 0.01 |
| Online | -0.35 | 0.70 | 0.41 | -29.88 | 0.00 |
| CreditCard | -0.48 | 0.62 | 0.38 | -38.16 | 0.00 |
| const | -5.34 | 0.00 | 0.00 | -99.52 | 0.00 |
Most overall significant varaibles are Income,Education, CD account ,Family and CCAvg
# Let's check model performances for this model
scores_LR = get_metrics_score(lg3,X_train3,X_test3,y_train,y_test,statmodel)
MODEL PERFORMANCE Accuracy : Train: 0.961 Test: 0.954 Recall : Train: 0.696 Test: 0.639 Precision : Train: 0.873 Test: 0.844 F1 : Train: 0.775 Test: 0.727
Insights:
True Positives:
Reality: A customer wanted to take personal Loan. Model Prediction: The customer will take personal loan. Outcome: The model is good.
True Negatives:
Reality: A customer didn't wanted to take personal loan. Model Prediction: The customer will not take personal loan. Outcome: The business is unaffected .
False Positives :
Reality: A customer didn't want to take personal loan. Model Prediction: The customer will take personal loan. Outcome: The team which is targeting the potential customers would waste their resources on the customers who will not be buying a personal loan.
False Negatives:
Reality: A customer wanted to take personal Loan. Model Prediction: The customer will not take personal loan. Outcome: The potential customer is missed by the salesteam. This is loss of oppurtunity. The purpose of campaign was to target such customers. If team knew about this customers, they could have offered some good APR /interest rates.
Right Metric to use:
Here not able to identify a potential customer is the biggest loss we can face. Hence, Recall is the right metric to check the performance of the model .We have recall as 68 on train and 67 on test. False negative are 107 and 47 on train and test. We can further improve this score using Optimal threshold for ROC AUC curve and precision recall curve
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
#fpr, tpr, thresholds = metrics.roc_curve(y_test, lg2.predict(X_test2))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.07725246629024471
scores_statmodel = get_metrics_score(lg3,X_train3,X_test3,y_train,y_test,statmodel,threshold=optimal_threshold_auc_roc,roc=True)
add_score_model(scores_statmodel)
MODEL PERFORMANCE Accuracy : Train: 0.895 Test: 0.884 Recall : Train: 0.914 Test: 0.868 Precision : Train: 0.475 Test: 0.446 F1 : Train: 0.625 Test: 0.59
ROC-AUC Score :Train:: 0.903 Test:: 0.877
With 0.092 Threshold the Recall score has improved from 68% to 87% on test data with 89% accuracy. Also False negative values has decreased to 18 from 46 for testdat. ROC-AUC score is 88 which is good.
y_scores=lg3.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.figure(figsize=(10,7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
optimal_threshold_curve = 0.3
scores_opt_curve = get_metrics_score(lg3,X_train3,X_test3,y_train,y_test,statmodel,threshold=optimal_threshold_curve,roc=True)
add_score_model(scores_opt_curve)
MODEL PERFORMANCE Accuracy : Train: 0.957 Test: 0.944 Recall : Train: 0.801 Test: 0.729 Precision : Train: 0.762 Test: 0.7 F1 : Train: 0.781 Test: 0.714
ROC-AUC Score :Train:: 0.887 Test:: 0.848
With this model the False negative cases have gone up and recall for test is 72 with 95 % accuracy. Model is performing well on training and test set. Model has given a balanced performance, if the bank wishes to maintain a balance between recall and precision this model can be used. Area under the curve has decreased as compared to the initial model but the performance is generalized on training and test set.
X_train_seq=X_train_stat
X_test_seq=X_test_stat
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
X_train_seq.shape
(3499, 16)
statmodel=0 #0 for sklearn and 1 for statmodel
clf = LogisticRegression(solver='newton-cg',random_state=1,fit_intercept=False)
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 16,forward=True,
floating=False, scoring= 'recall',
verbose=2,
cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train_seq, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.6s finished
[2021-08-02 18:32:56] Features: 1/16 -- score: 0.9941176470588236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.8s finished
[2021-08-02 18:32:57] Features: 2/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.9s finished
[2021-08-02 18:32:58] Features: 3/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 1.7s finished
[2021-08-02 18:33:00] Features: 4/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 1.6s finished
[2021-08-02 18:33:02] Features: 5/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 1.6s finished
[2021-08-02 18:33:03] Features: 6/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.5s finished
[2021-08-02 18:33:05] Features: 7/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 1.6s finished
[2021-08-02 18:33:06] Features: 8/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.3s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.5s finished
[2021-08-02 18:33:08] Features: 9/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 1.1s finished
[2021-08-02 18:33:09] Features: 10/16 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 1.1s finished
[2021-08-02 18:33:10] Features: 11/16 -- score: 0.9940737489025461[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[2021-08-02 18:33:11] Features: 12/16 -- score: 0.9910886742756805[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.7s finished
[2021-08-02 18:33:11] Features: 13/16 -- score: 0.9881035996488148[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.5s finished
[2021-08-02 18:33:12] Features: 14/16 -- score: 0.9761633011413521[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.4s finished
[2021-08-02 18:33:12] Features: 15/16 -- score: 0.9791483757682178[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s finished
[2021-08-02 18:33:12] Features: 16/16 -- score: 0.672519754170325
statmodel=0 #0 for sklearn and 1 for statmodel
clf = LogisticRegression(solver='newton-cg',random_state=1,fit_intercept=False)
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 11,forward=True,
floating=False, scoring= 'recall',
verbose=2,
cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train_seq, y_train)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.6s finished
[2021-08-02 18:33:14] Features: 1/11 -- score: 0.9941176470588236[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.8s finished
[2021-08-02 18:33:15] Features: 2/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.9s finished
[2021-08-02 18:33:15] Features: 3/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 1.6s finished
[2021-08-02 18:33:17] Features: 4/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 1.7s finished
[2021-08-02 18:33:19] Features: 5/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 1.6s finished
[2021-08-02 18:33:20] Features: 6/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 1.6s finished
[2021-08-02 18:33:22] Features: 7/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 1.3s finished
[2021-08-02 18:33:23] Features: 8/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.2s finished
[2021-08-02 18:33:24] Features: 9/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 1.1s finished
[2021-08-02 18:33:25] Features: 10/11 -- score: 0.9970588235294118[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.2s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.9s finished
[2021-08-02 18:33:26] Features: 11/11 -- score: 0.9940737489025461
# Now Which features are important?
feat_cols = list(sfs1.k_feature_idx_)
print(feat_cols)
[1, 2, 4, 5, 6, 8, 9, 10, 11, 13, 14]
X_train_seq.columns[feat_cols]
Index(['Age', 'Income', 'CCAvg', 'Mortgage', 'SecuritiesAccount', 'Online', 'CreditCard', 'Regions_Central', 'Regions_Los Angeles Region', 'Regions_Superior', 'Education_2'], dtype='object')
X_train_final = X_train_seq[X_train_seq.columns[feat_cols]]
X_test_final = X_test_seq[X_train_final.columns]
lr = LogisticRegression(solver='newton-cg',random_state=1,fit_intercept=False)
model = lr.fit(X_train_final,y_train)
statmodel=0 #0 for sklearn and 1 for statmodel
# Let's check model performances for this model
scores_sfs = get_metrics_score(model,X_train_final,X_test_final,y_train,y_test,statmodel)
add_score_model(scores_sfs)
MODEL PERFORMANCE
Accuracy : Train: 0.677 Test: 0.686 Recall : Train: 0.997 Test: 0.972 Precision : Train: 0.229 Test: 0.231 F1 : Train: 0.372 Test: 0.373
comparison_frame = pd.DataFrame({'Model':['Logistic Regression Model- Sklearn',
'Logistic Regression Model - Statsmodels',
'Logistic Regression - Optimal threshold = 0.092',
'Logistic Regression - Optimal threshold = 0.3',
'Logistic Regression - Sequential feature selection'
],
'Train_Accuracy':acc_train,
'Test_Accuracy':acc_test,
'Train Recall':recall_train,
'Test Recall':recall_test,
'Train Precision':precision_train,
'Test Precision':precision_test,
'Train F1':f1_train,
'Test F1':f1_test
})
comparison_frame
Model Train_Accuracy Test_Accuracy Train Recall Test Recall Train Precision Test Precision Train F1 Test F1
0 Logistic Regression Model- Sklearn 0.66 0.65 0.98 0.99 0.22 0.21 0.35 0.35 1 Logistic Regression Model - Statsmodels 0.96 0.96 0.68 0.67 0.86 0.84 0.76 0.75 2 Logistic Regression - Optimal threshold = 0.092 0.90 0.90 0.90 0.88 0.49 0.48 0.63 0.62 3 Logistic Regression - Optimal threshold = 0.3 0.95 0.94 0.80 0.73 0.75 0.70 0.77 0.72 4 Logistic Regression - Sequential feature selection 0.68 0.69 1.00 0.97 0.23 0.23 0.37 0.37
Model building Decision Tree
#drop column which we don't need for modelling
df_Decision.drop(columns=["Agebin", "ZIPCode","County",'Experience','Income_group','Spending_group'], inplace=True)
X_dt = df_Decision.drop('Personal_Loan', axis=1)
y_dt = df_Decision['Personal_Loan']
#oneHotCols=['Regions']
oneHotCols=X_dt.select_dtypes(exclude='number').columns.to_list()
X_dt=pd.get_dummies(X_dt,columns=oneHotCols,drop_first=True)
# Spliting data set
X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(X_dt, y_dt, test_size=0.3, random_state=1, stratify=y_dt)
Build Model
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
To handle this imbalanced data set,we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
Since not being able to identify a potential customer is the biggest loss as mentioned earlier with logistic regression. Hence, recall is the right metric to check the performance of the model.
## Function to calculate recall score
def get_recall_score(model):
'''
model : classifier to predict values of X
'''
ytrain_predict = model.predict(X_train_dt)
ytest_predict = model.predict(X_test_dt)
# accuracy on training set
print("\x1b[0;30;47m \033[1mAccuracy : Train :\033[0m",
model.score(X_train_dt,y_train_dt),
"\x1b[0;30;47m \033[1mTest:\033[0m",
model.score(X_test_dt,y_test_dt))
# accuracy on training set
print("\x1b[0;30;47m \033[1mRecall : Train :\033[0m",
metrics.recall_score(y_train_dt,ytrain_predict),
"\x1b[0;30;47m \033[1mTest:\033[0m",
metrics.recall_score(y_test_dt,ytest_predict))
make_confusion_matrix(y_train_dt,ytrain_predict,"Confusion Matric on Train Data")
make_confusion_matrix(y_test_dt,ytest_predict,"Confusion Matric on Test Data")
#since data is imbalanced adding weights
model = DecisionTreeClassifier(criterion = 'gini',class_weight={0:0.15,1:0.85}, random_state=1)
model.fit(X_train_dt, y_train_dt)
get_recall_score(model)
Accuracy : Train : 1.0 Test: 0.9693333333333334 Recall : Train : 1.0 Test: 0.8194444444444444
Decision tree tends to Overfit and the disparity between the Recall on Train and Test suggest that the model is overfitted
column_names = list(X_dt.columns)
feature_names = column_names
print(feature_names)
['ID', 'Age', 'Income', 'CCAvg', 'Mortgage', 'Family_2', 'Family_3', 'Family_4', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'Regions_Central', 'Regions_Los Angeles Region', 'Regions_Southern', 'Regions_Superior']
plt.figure(figsize=(20,30))
from sklearn import tree
from sklearn.model_selection import GridSearchCV
out = tree.plot_tree(model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model,feature_names=feature_names,show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [369.60, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 81.50 | | | |--- Age <= 36.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- Mortgage <= 159.50 | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | |--- Mortgage > 159.50 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- Regions_Southern <= 0.50 | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | | | |--- Regions_Southern > 0.50 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | |--- Age > 36.50 | | | | |--- weights: [9.15, 0.00] class: 0 | | |--- Income > 81.50 | | | |--- CCAvg <= 4.40 | | | | |--- ID <= 975.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- ID > 975.00 | | | | | |--- ID <= 4361.50 | | | | | | |--- Regions_Central <= 0.50 | | | | | | | |--- Age <= 43.50 | | | | | | | | |--- Age <= 31.50 | | | | | | | | | |--- Regions_Southern <= 0.50 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- Regions_Southern > 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Age > 31.50 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | |--- Age > 43.50 | | | | | | | | |--- Regions_Southern <= 0.50 | | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- CCAvg > 4.20 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- Regions_Southern > 0.50 | | | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | | | |--- weights: [0.00, 5.10] class: 1 | | | | | | | | | |--- CCAvg > 4.20 | | | | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- Education_2 > 0.50 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- Regions_Central > 0.50 | | | | | | | |--- CCAvg <= 3.40 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.40 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- ID > 4361.50 | | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- CCAvg > 4.40 | | | | |--- weights: [2.40, 0.00] class: 0 |--- Income > 98.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_4 <= 0.50 | | | | |--- Family_3 <= 0.50 | | | | | |--- Income <= 101.50 | | | | | | |--- CCAvg <= 2.95 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | |--- CCAvg > 2.95 | | | | | | | |--- Age <= 40.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Age > 40.50 | | | | | | | | |--- Income <= 100.00 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- Income > 100.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Income > 101.50 | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | |--- weights: [64.50, 0.00] class: 0 | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | |--- weights: [6.90, 0.00] class: 0 | | | | |--- Family_3 > 0.50 | | | | | |--- Income <= 118.00 | | | | | | |--- ID <= 711.50 | | | | | | | |--- CCAvg <= 3.25 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.25 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ID > 711.50 | | | | | | | |--- weights: [1.95, 0.00] class: 0 | | | | | |--- Income > 118.00 | | | | | | |--- weights: [0.00, 20.40] class: 1 | | | |--- Family_4 > 0.50 | | | | |--- Income <= 103.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Income > 103.00 | | | | | |--- weights: [0.00, 23.80] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 103.50 | | | | |--- CCAvg <= 2.85 | | | | | |--- Mortgage <= 63.00 | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | |--- Mortgage > 63.00 | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | |--- CCAvg > 2.85 | | | | | |--- Mortgage <= 41.00 | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Mortgage > 41.00 | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | |--- Income > 103.50 | | | | |--- Income <= 114.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- ID <= 3968.00 | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | |--- Family_4 > 0.50 | | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ID > 3968.00 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | |--- CCAvg > 2.85 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- weights: [0.00, 6.80] class: 1 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- CCAvg > 4.20 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 114.50 | | | | | |--- ID <= 4852.50 | | | | | | |--- weights: [0.00, 89.25] class: 1 | | | | | |--- ID > 4852.50 | | | | | | |--- Income <= 118.00 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Income > 118.00 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | |--- Education_3 > 0.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.35 | | | | |--- CCAvg <= 1.25 | | | | | |--- Income <= 108.00 | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | |--- Income > 108.00 | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- Family_4 > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 1.25 | | | | | |--- Regions_Los Angeles Region <= 0.50 | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | |--- Regions_Los Angeles Region > 0.50 | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | |--- CCAvg > 2.35 | | | | |--- Family_2 <= 0.50 | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | |--- ID <= 4852.50 | | | | | | | |--- Age <= 65.00 | | | | | | | | |--- weights: [0.00, 10.20] class: 1 | | | | | | | |--- Age > 65.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- ID > 4852.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Securities_Account_1 > 0.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Family_2 > 0.50 | | | | | |--- Income <= 100.00 | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Income > 100.00 | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | |--- Income <= 112.00 | | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- CCAvg > 4.20 | | | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Income > 112.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- Income <= 116.50 | | | | |--- ID <= 3392.00 | | | | | |--- CreditCard_1 <= 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- CreditCard_1 > 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- ID > 3392.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- Income > 116.50 | | | | |--- CD_Account_1 <= 0.50 | | | | | |--- weights: [0.00, 73.95] class: 1 | | | | |--- CD_Account_1 > 0.50 | | | | | |--- weights: [0.00, 28.90] class: 1
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='purple', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Using GridSearch for Hyperparameter tuning of our tree model
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [5, 10,15,20,25,30],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train_dt, y_train_dt)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
estimator
DecisionTreeClassifier(max_depth=8, max_leaf_nodes=25, random_state=1)
# Fit the best algorithm to the data.
estimator.fit(X_train_dt, y_train_dt)
ytrain_predict=estimator.predict(X_train_dt)
ytest_predict=estimator.predict(X_test_dt)
plt.figure(figsize=(15,10))
out = tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
get_recall_score(estimator)
Accuracy : Train : 0.9928571428571429 Test: 0.9813333333333333 Recall : Train : 0.9494047619047619 Test: 0.875
Observations
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train_dt, y_train_dt)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
fig, ax = plt.subplots(figsize=(15,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. We will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, we will find the accuracy on both Train and Test parts of our dataset.
clfs = []
accuracy_train=[]
accuracy_test=[]
recall_train=[]
recall_test=[]
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha,class_weight = {0:0.15,1:0.85})
clf.fit(X_train_dt, y_train_dt)
y_train_pred=clf.predict(X_train_dt)
y_test_pred=clf.predict(X_test_dt)
accuracy_train.append(clf.score(X_train_dt,y_train_dt))
accuracy_test.append(clf.score(X_test_dt,y_test_dt))
recall_train.append(metrics.recall_score(y_train_dt,y_train_pred))
recall_test.append(metrics.recall_score(y_test_dt,y_test_pred))
clfs.append(clf)
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
We are gettingt a higher recall on test data between 0.002 to 0.005. Will choose alpha as 0.002.
best_model = DecisionTreeClassifier(ccp_alpha=0.002,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train_dt, y_train_dt)
DecisionTreeClassifier(ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85},
random_state=1)
get_recall_score(best_model)
Accuracy : Train : 0.9802857142857143 Test: 0.9706666666666667 Recall : Train : 0.9880952380952381 Test: 0.9375
The Recall on train and test indicate we have created a generalized model. with 96 % accuracy and reduced False negatives.
plt.figure(figsize=(15,10))
out = tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=False,class_names=True)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor('black')
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model,feature_names=feature_names,show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [369.60, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 81.50 | | | |--- Age <= 36.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [0.30, 2.55] class: 1 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [1.65, 0.00] class: 0 | | | |--- Age > 36.50 | | | | |--- weights: [9.15, 0.00] class: 0 | | |--- Income > 81.50 | | | |--- CCAvg <= 4.40 | | | | |--- ID <= 975.00 | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | |--- ID > 975.00 | | | | | |--- weights: [3.90, 12.75] class: 1 | | | |--- CCAvg > 4.40 | | | | |--- weights: [2.40, 0.00] class: 0 |--- Income > 98.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_4 <= 0.50 | | | | |--- Family_3 <= 0.50 | | | | | |--- Income <= 101.50 | | | | | | |--- weights: [0.90, 2.55] class: 1 | | | | | |--- Income > 101.50 | | | | | | |--- weights: [71.40, 0.00] class: 0 | | | | |--- Family_3 > 0.50 | | | | | |--- Income <= 118.00 | | | | | | |--- weights: [2.10, 0.85] class: 0 | | | | | |--- Income > 118.00 | | | | | | |--- weights: [0.00, 20.40] class: 1 | | | |--- Family_4 > 0.50 | | | | |--- weights: [0.15, 23.80] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 103.50 | | | | |--- CCAvg <= 2.85 | | | | | |--- weights: [2.55, 0.00] class: 0 | | | | |--- CCAvg > 2.85 | | | | | |--- weights: [0.30, 1.70] class: 1 | | | |--- Income > 103.50 | | | | |--- weights: [2.70, 101.15] class: 1 | |--- Education_3 > 0.50 | | |--- Income <= 114.50 | | | |--- CCAvg <= 2.35 | | | | |--- weights: [4.20, 2.55] class: 0 | | | |--- CCAvg > 2.35 | | | | |--- weights: [1.35, 12.75] class: 1 | | |--- Income > 114.50 | | | |--- weights: [0.15, 104.55] class: 1
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
We are getting a higher recall on test data between 0.002 to 0.005. Will choosed alpha as 0.002. The Recall on train and test indicate we have created a generalized model. with 96 % accuracy and reduced False negatives. Important features : Income, Graduate education, Family member 3 and 4, Ccavg, Advanced education, Age. This is the best model as false negative is only 6 on Testdata.
comparison_frame = pd.DataFrame({'Model':['Logisitic Regression with Optimal Threshold 0.104',
'Initial decision tree model',
'Decision treee with hyperparameter tuning',
'Decision tree with post-pruning'],
'Train_accuracy':[0.92,1,0.99,0.98],
'Test_accuracy':[0.91,0.98,0.98,0.97],
'Train_Recall':[0.90,1,0.92,0.98],
'Test_Recall':[0.88,0.86,0.84,0.96]})
comparison_frame
| Model | Train_accuracy | Test_accuracy | Train_Recall | Test_Recall | |
|---|---|---|---|---|---|
| 0 | Logisitic Regression with Optimal Threshold 0.104 | 0.92 | 0.91 | 0.90 | 0.88 |
| 1 | Initial decision tree model | 1.00 | 0.98 | 1.00 | 0.86 |
| 2 | Decision treee with hyperparameter tuning | 0.99 | 0.98 | 0.92 | 0.84 |
| 3 | Decision tree with post-pruning | 0.98 | 0.97 | 0.98 | 0.96 |
y_pred = best_model.predict(X_test_dt)
print(classification_report(y_test_dt,y_pred))
make_confusion_matrix(y_test,y_pred,"confusion matrix on test")
precision recall f1-score support
0 0.99 0.97 0.98 1356
1 0.79 0.94 0.86 144
accuracy 0.97 1500
macro avg 0.89 0.96 0.92 1500
weighted avg 0.97 0.97 0.97 1500
Observation
After Post Pruning ,the false negative has reduced to 6.The accuracy on test data is 97% & Recall is 97% after choosing optimal cc-alpha.
misclass_df = X_test_dt.copy()
misclass_df['Actual']=y_test_dt
misclass_df['Predicted'] = y_pred
plt.pie(data=misclass_df,x=misclass_df["Actual"].value_counts(),autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1e55b59b850>, <matplotlib.patches.Wedge at 0x1e55b59bfd0>], [Text(-1.050350983068654, 0.3267457916587637, ''), Text(1.0503509677725784, -0.32674584082924163, '')], [Text(-0.5729187180374475, 0.17822497726841657, '90.4%'), Text(0.5729187096941336, -0.17822500408867725, '9.6%')])
plt.pie(data=misclass_df,x=misclass_df["Predicted"].value_counts(),autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1e55a7af460>, <matplotlib.patches.Wedge at 0x1e55a7afa30>], [Text(-1.0310101750386982, 0.38342928809191557, ''), Text(1.0310101660638873, -0.383429312224451, '')], [Text(-0.5623691863847444, 0.20914324805013573, '88.7%'), Text(0.5623691814893931, -0.2091432612133369, '11.3%')])
Percentage of value predicted by our model has been very close to the actual values. Lets find out False Negative and False Positive observations
pd.crosstab(misclass_df['Predicted'],misclass_df['Actual'],margins=True)
| Actual | 0 | 1 | All |
|---|---|---|---|
| Predicted | |||
| 0 | 1321 | 9 | 1330 |
| 1 | 35 | 135 | 170 |
| All | 1356 | 144 | 1500 |
pd.crosstab(misclass_df['Predicted'],misclass_df['Actual'],normalize='index').plot(kind='bar',stacked=True)
<AxesSubplot:xlabel='Predicted'>
# Rows that were classified as Negative when they were actually positive
fn_rows = misclass_df[(misclass_df['Actual'] == 1) & (misclass_df['Predicted'] == 0)]
False_negative= df_Decision[df_Decision.index.isin(fn_rows.index.values)].copy()
False_negative
| ID | Age | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 464 | 465 | 43 | 83 | 4 | 3.60 | 3 | 0 | 1 | 0 | 0 | 0 | 1 | Bay Area |
| 927 | 928 | 65 | 95 | 3 | 3.70 | 2 | 138 | 1 | 0 | 0 | 0 | 1 | Bay Area |
| 951 | 952 | 59 | 83 | 2 | 3.40 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | Bay Area |
| 1069 | 1070 | 44 | 75 | 2 | 3.50 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | Los Angeles Region |
| 1126 | 1127 | 32 | 104 | 2 | 3.70 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | Bay Area |
| 1518 | 1519 | 43 | 64 | 4 | 3.00 | 3 | 221 | 1 | 0 | 0 | 1 | 0 | Bay Area |
| 2361 | 2362 | 36 | 109 | 3 | 0.50 | 3 | 0 | 1 | 1 | 1 | 1 | 0 | Bay Area |
| 2623 | 2624 | 42 | 111 | 3 | 3.00 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | Bay Area |
| 2996 | 2997 | 42 | 103 | 1 | 3.33 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | Los Angeles Region |
# Rows that were classified as postive when they were actually negative
fp_rows = misclass_df[(misclass_df['Actual'] == 0) & (misclass_df['Predicted'] == 1)]
fp_rows
False_Positive= df_Decision[df_Decision.index.isin(fp_rows.index.values)].copy()
False_Positive
| ID | Age | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | Regions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 11 | 65 | 105 | 4 | 2.40 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Bay Area |
| 185 | 186 | 39 | 115 | 1 | 1.00 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Southern |
| 553 | 554 | 52 | 101 | 2 | 0.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles Region |
| 1045 | 1046 | 43 | 84 | 1 | 4.00 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Southern |
| 1088 | 1089 | 59 | 95 | 1 | 3.80 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | Superior |
| 1138 | 1139 | 30 | 83 | 4 | 3.40 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | Southern |
| 1147 | 1148 | 37 | 111 | 1 | 0.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles Region |
| 1383 | 1384 | 65 | 105 | 4 | 1.70 | 2 | 230 | 0 | 1 | 0 | 1 | 0 | Central |
| 1857 | 1858 | 37 | 105 | 1 | 0.80 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Bay Area |
| 1886 | 1887 | 65 | 115 | 4 | 1.70 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Bay Area |
| 1936 | 1937 | 50 | 82 | 3 | 3.00 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles Region |
| 2009 | 2010 | 25 | 99 | 1 | 1.90 | 1 | 323 | 0 | 0 | 0 | 0 | 0 | Southern |
| 2205 | 2206 | 63 | 101 | 2 | 2.80 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Central |
| 2311 | 2312 | 62 | 115 | 4 | 3.40 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Los Angeles Region |
| 2479 | 2480 | 55 | 82 | 4 | 3.80 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Bay Area |
| 2523 | 2524 | 49 | 100 | 2 | 6.30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Bay Area |
| 2665 | 2666 | 35 | 105 | 2 | 4.50 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | Los Angeles Region |
| 2862 | 2863 | 65 | 113 | 4 | 2.40 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Southern |
| 3030 | 3031 | 50 | 110 | 3 | 1.80 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Bay Area |
| 3064 | 3065 | 59 | 83 | 3 | 4.40 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | Bay Area |
| 3262 | 3263 | 44 | 85 | 2 | 3.80 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | Los Angeles Region |
| 3322 | 3323 | 41 | 104 | 1 | 4.00 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | Southern |
| 3409 | 3410 | 29 | 113 | 2 | 2.00 | 2 | 84 | 0 | 0 | 0 | 1 | 1 | Central |
| 3657 | 3658 | 52 | 104 | 2 | 2.40 | 2 | 0 | 0 | 0 | 0 | 1 | 1 | Bay Area |
| 3754 | 3755 | 63 | 112 | 4 | 2.40 | 3 | 0 | 0 | 0 | 0 | 1 | 1 | Southern |
| 4078 | 4079 | 36 | 58 | 1 | 3.60 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Southern |
| 4187 | 4188 | 30 | 109 | 4 | 2.20 | 2 | 103 | 0 | 0 | 0 | 0 | 1 | Bay Area |
| 4211 | 4212 | 40 | 104 | 2 | 1.80 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | Bay Area |
| 4268 | 4269 | 49 | 108 | 2 | 2.40 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Central |
| 4273 | 4274 | 44 | 83 | 2 | 3.80 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | Central |
| 4348 | 4349 | 59 | 99 | 2 | 2.70 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Southern |
| 4671 | 4672 | 39 | 104 | 1 | 4.00 | 3 | 0 | 0 | 0 | 1 | 1 | 1 | Bay Area |
| 4678 | 4679 | 33 | 115 | 1 | 2.70 | 2 | 283 | 0 | 0 | 0 | 1 | 0 | Southern |
| 4815 | 4816 | 58 | 99 | 2 | 1.40 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | Southern |
| 4930 | 4931 | 63 | 110 | 3 | 1.80 | 2 | 109 | 0 | 0 | 0 | 1 | 0 | Los Angeles Region |
On analyzing the Education , we can see most of them have education as Advance or Graduate . These cases are some exceptions.